Defining Data Clusters

نویسنده

  • Ali Jadbabaie
چکیده

Clustering has been exten­ sively studied in data analy­ sis. However, clustering theories have thus far been unsatisfactory to justify use of the many proposed algorithms. A fun­ damental question is this: How do we define a cluster in a set of data points? In “A Mathematical Theory for Clus­ tering in Metric Spaces,” Cheng­Shang Chang and his colleagues attempt to answer this question by considering a set of data points associated with a dis­ tance measure, or metric (IEEE Trans. Network Science and Eng., vol. 3, no. 1, 2016, pp. 2–16). They propose a new cohesion measure with regard to the distance measure. Using this cohesion measure, they define a cluster as a set of points cohesive to themselves, and provide various equivalent statements with intuitive explanations to support their definition. The authors then consider a second question: How do we use this definition to find clusters and good partitions of clusters? Chang and his colleagues offer two algorithms: hierarchical ag­ glomerative and partitional. Unlike standard hierarchical agglomerative algorithms, their algorithm stops specifically at partitions of clusters. Their partitional algorithm—dubbed the K­sets algorithm—is quite novel and interesting. Unlike the K­means algorithm (for data points in a Euclid­ ean space) that assigns points to the closest centroid, the K­sets algorithm assigns points to the closest set with regard to the triangular distance—the average difference of the three sides of a triangle formed by the point and two randomly selected points in the set. As such, the K­sets algorithm per­ forms well when a cluster can’t be rep­ resented by a single point. The authors also find that the du­ ality between a distance measure and a cohesion measure leads to a dual K­sets algorithm for clustering a set of data points with a cohesion measure. The dual K-sets algorithm converges in the same way as a sequential ver­ sion of the classical kernel K­means algorithm. The key difference is that a cohesion measure doesn’t have to be positive semidefinite.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analyzing Data Clusters: A Rough Sets Approach to Extract Cluster-Defining Symbolic Rules

In this paper we present a strategy together with its computational implementation to intelligently analyze data clusters in terms of symbolic cluster-defining rules. We present a symbolic rule extraction workbench that leverages rough set theory to inductively extract CNF form symbolic rules from un-annotated continuous-valued data-vectors. Our workbench purports a hybrid rule extraction metho...

متن کامل

Recovery Rate of Clustering Algorithms

This article provides a simple and general way for defining the recovery rate of clustering algorithms using a given family of old clusters for evaluating the performance of the algorithm when calculating a family of new clusters. Under the assumption of dealing with simulated data (i.e., known old clusters), the recovery rate is calculated using one proposed exact (but slow) algorithm, or one ...

متن کامل

A Comprehensive Study of Challenges and Approaches for Clustering High Dimensional Data

Clustering is one of the most effective methods for summarizing and analyzing datasets that are collection of data objects similar or dissimilar in nature. Clustering aims at finding groups, or clusters, of objects with similar attributes. Most clustering methods work efficiently for low dimensional data since distance measures are used to find dissimilarities between objects. High dimensional ...

متن کامل

Graph Clustering by Hierarchical Singular Value Decomposition with Selectable Range for Number of Clusters Members

Graphs have so many applications in real world problems. When we deal with huge volume of data, analyzing data is difficult or sometimes impossible. In big data problems, clustering data is a useful tool for data analysis. Singular value decomposition(SVD) is one of the best algorithms for clustering graph but we do not have any choice to select the number of clusters and the number of members ...

متن کامل

Improved Parameterless K-Means: Auto-Generation Centroids and Distance Data Point Clusters

K-means is an unsupervised learning and partitioning clustering algorithm. It is popular and widely used for its simplicity and fastness. K-means clustering produce a number of separate flat (non-hierarchical) clusters and suitable for generating globular clusters. The main drawback of the k-means algorithm is that the user must specify the number of clusters in advance. This paper presents an ...

متن کامل

Adaptations for finding irregularly shaped disease clusters

BACKGROUND Recent adaptations of the spatial scan approach to detecting disease clusters have addressed the problem of finding clusters that occur in non-compact and non-circular shapes--such as along roads or river networks. Some of these approaches may have difficulty defining cluster boundaries precisely, and tend to over-fit data with very irregular (and implausible) clusters shapes. RESU...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Computer

دوره 49  شماره 

صفحات  -

تاریخ انتشار 2016